{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "55e51f55-ce98-4cbd-86a2-a433a7d51d3c",
   "metadata": {},
   "source": [
    "# Using GPT-4 to bootstrap few-shot CoT demonstations for GPT-3.5"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "994104c5-7564-4aef-a6a2-e44d0049a23b",
   "metadata": {},
   "source": [
    "The [Scoped Negation (ScoNe) benchmark of She et al. (2023)](https://aclanthology.org/2023.acl-short.154/) seeks to stress-test models on their ability to reason about negation. In the original paper, the `text-davinci-002` and `text-davinci-003` models were more or less at chance on the hardest ScoNe categories.\n",
    "\n",
    "This notebook starts with a very simple Chain-of-Thought-based module for ScoNe. `gpt-3.5-turbo` is at chance on the \"one scoping negation\" category (one of the two hardest in ScoNe) using this simple program. \n",
    "\n",
    "We figured that bootstrapping demonstrations would help, but `turbo` struggled to create good demonstrations that included CoT steps. When we switched to using `gpt4-turbo` just to create these demonstrations (which involves under 50 calls to that model), `turbo` regularly achieved 85–90% accuracy. **This is a single compilation step using `dspy.BootstrapFewShotWithRandomSearch`.**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d155ab52-564d-4a20-8973-be86a82a231a",
   "metadata": {},
   "source": [
    "## Set-up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "fb53ae9e-532a-4fb5-8099-4d21738538f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import glob\n",
    "import os\n",
    "import pandas as pd\n",
    "import random\n",
    "\n",
    "import dspy\n",
    "from dspy.evaluate import Evaluate\n",
    "from dspy.teleprompt import BootstrapFewShotWithRandomSearch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "96e24f6b-67e5-4b20-b9d4-0e6f0d7f06c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ[\"DSP_NOTEBOOK_CACHEDIR\"] = os.path.join('.', 'cache')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "92f83973-1e23-476f-bf02-c80f30a9911f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# We'll rely on turbo for everything except bootstrapping CoT demos:\n",
    "\n",
    "turbo = dspy.OpenAI(model='gpt-3.5-turbo-1106', max_tokens=250, model_type='chat')\n",
    "\n",
    "dspy.settings.configure(lm=turbo)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d0680085-df6d-4729-b952-6a5a69b00519",
   "metadata": {},
   "outputs": [],
   "source": [
    "# GPT-4 will be used only to bootstrap CoT demos:\n",
    "\n",
    "gpt4T = dspy.OpenAI(model='gpt-4-1106-preview', max_tokens=350, model_type='chat')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "17761e10-01b1-414f-962c-5efdbdb81fcc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Toggling this to true will redo the bootstrapping process. When\n",
    "# it is set to False, the existing demonstrations will be used but\n",
    "# turbo will still be used to evaluate the zero-shot and full programs.\n",
    "RUN_FROM_SCRATCH = False"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9b08d0c-e6df-4d6e-8aed-a0eeb08039fd",
   "metadata": {},
   "source": [
    "## ScoNe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "68c340a3-b2a7-4b53-aff7-636ad36dddca",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cloning into 'ScoNe'...\n",
      "remote: Enumerating objects: 77, done.\u001b[K\n",
      "remote: Counting objects: 100% (77/77), done.\u001b[K\n",
      "remote: Compressing objects: 100% (55/55), done.\u001b[K\n",
      "remote: Total 77 (delta 42), reused 42 (delta 20), pack-reused 0\u001b[K\n",
      "Receiving objects: 100% (77/77), 116.25 KiB | 1.21 MiB/s, done.\n",
      "Resolving deltas: 100% (42/42), done.\n"
     ]
    }
   ],
   "source": [
    "!git clone https://github.com/selenashe/ScoNe.git"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2973a74-63a4-4913-ae3b-9a9ffb631293",
   "metadata": {},
   "source": [
    "### Data loader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "4fb01601-ee8e-4b92-902b-18a89236e750",
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_scone(dirname):\n",
    "    dfs = []\n",
    "    for filename in glob.glob(dirname + \"/*.csv\"):\n",
    "        df = pd.read_csv(filename, index_col=0)\n",
    "        df['category'] = os.path.basename(filename).replace(\".csv\", \"\")\n",
    "        dfs.append(df)\n",
    "    data_df = pd.concat(dfs)\n",
    "\n",
    "    def as_example(row):\n",
    "        # The 'one_scoped' file is from an earlier dataset, MoNLI, and\n",
    "        # so is formatted a bit differently:\n",
    "        suffix = '' if row['category'] == 'one_scoped' else '_edited'\n",
    "        # Reformat the hypothesis to be an embedded clause in a question:\n",
    "        hkey = 'sentence2' + suffix\n",
    "        question = row[hkey][0].lower() + row[hkey][1: ].strip(\".\")\n",
    "        question = f\"Can we logically conclude for sure that {question}?\"\n",
    "        # Binary task formulation:\n",
    "        label = \"Yes\" if row['gold_label' + suffix] == 'entailment' else \"No\"\n",
    "        return dspy.Example({\n",
    "            \"context\": row['sentence1' + suffix],\n",
    "            \"question\": question,\n",
    "            \"answer\": label,\n",
    "            \"category\": row['category']\n",
    "        }).with_inputs(\"context\", \"question\")\n",
    "\n",
    "    return list(data_df.apply(as_example, axis=1).values)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1464863-00af-4eb1-b3a6-874830462a7e",
   "metadata": {},
   "source": [
    "### Train and dev samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "8d6ae528-fea9-4692-92f5-275aa8989e01",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(200, 50)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_train = load_scone(\"ScoNe/scone_nli/train\")\n",
    "\n",
    "random.seed(1)\n",
    "random.shuffle(all_train)\n",
    "\n",
    "# 200 random train, 50 random dev:\n",
    "train, dev = all_train[: 200], all_train[200: 250]\n",
    "\n",
    "len(train), len(dev)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5fae6481-7f79-4616-9261-6c0590df636c",
   "metadata": {},
   "source": [
    "### Test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "89eec431-1e24-41d3-bb6a-0a36c8ed3e14",
   "metadata": {},
   "outputs": [],
   "source": [
    "random.seed(1)\n",
    "\n",
    "test = load_scone(dirname=f\"ScoNe/scone_nli/test\")\n",
    "\n",
    "# We're developing a system for the full ScoNe benchmark, but we'll\n",
    "# evaluate only on one of the hardest and most informative ScoNe\n",
    "# categories for now -- examples with a single negation that plays\n",
    "# a crucial role in the reasoning:\n",
    "test = [ex for ex in test if ex.category == \"one_scoped\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "24938d16-4ea9-4f5b-838a-db7546e34585",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "No     100\n",
       "Yes    100\n",
       "dtype: int64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.Series([ex.answer for ex in test]).value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d103da69-86ef-4666-9bec-70473dd7055b",
   "metadata": {},
   "source": [
    "## Evaluation tools"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "81c207d3-8a47-4df7-b8e5-2b3182fe6c1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "scone_accuracy = dspy.evaluate.metrics.answer_exact_match"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "6380249a-b4cd-4910-a832-7ab7e9300e38",
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluator = Evaluate(devset=test, num_threads=1, display_progress=True, display_table=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fef0552b-d5c5-4717-8b1d-c020f6e6eee0",
   "metadata": {},
   "source": [
    "## Zero-shot CoT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "d385da93-cda6-44b0-bae6-dca95501fb83",
   "metadata": {},
   "outputs": [],
   "source": [
    "class ScoNeSignature(dspy.Signature):\n",
    "    (\"\"\"You are given some context (a premise) and a question (a hypothesis). \"\"\"\n",
    "    \"\"\"You must indicate with Yes/No answer whether we can logically \"\"\"\n",
    "    \"\"\"conclude the hypothesis from the premise.\"\"\")\n",
    "\n",
    "    context = dspy.InputField()\n",
    "    question = dspy.InputField()\n",
    "    answer = dspy.OutputField(desc=\"Yes or No\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "882d8cbb-94d3-4058-9639-1bd8502acae5",
   "metadata": {},
   "outputs": [],
   "source": [
    "class ScoNeCoT(dspy.Module):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        self.generate_answer = dspy.ChainOfThought(ScoNeSignature)\n",
    "\n",
    "    def forward(self, context, question):\n",
    "        return self.generate_answer(context=context, question=question)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "398bdece-5a40-42fe-9250-6fe54346ff51",
   "metadata": {},
   "outputs": [],
   "source": [
    "cot_zeroshot = ScoNeCoT()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "f799ec4d-70ce-4e5c-a0d7-c349bdab33d5",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Average Metric: 100 / 200  (50.0): 100%|█████████████████████████| 200/200 [00:00<00:00, 733.75it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Metric: 100 / 200  (50.0%)\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "50.0"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluator(cot_zeroshot, metric=scone_accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92fe5ded-3c8e-456c-be57-8ffa014d2920",
   "metadata": {},
   "source": [
    "## Optimized few-shot with bootstrapped demonstrations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "57b783f2-c848-4f7a-b5bf-91c723f7e5d0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Going to sample between 1 and 8 traces per predictor.\n",
      "Will attempt to train 10 candidate sets.\n"
     ]
    }
   ],
   "source": [
    "bootstrap_optimizer = BootstrapFewShotWithRandomSearch(\n",
    "    max_bootstrapped_demos=8,\n",
    "    max_labeled_demos=8,\n",
    "    num_candidate_programs=10,\n",
    "    num_threads=8,\n",
    "    metric=scone_accuracy,\n",
    "    teacher_settings=dict(lm=gpt4T))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "6809523b-2173-4e70-af1e-854d1328a3cb",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Average Metric: 24 / 50  (48.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1096.32it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Metric: 24 / 50  (48.0%)\n",
      "Score: 48.0 for set: [0]\n",
      "New best score: 48.0 for seed -3\n",
      "Scores so far: [48.0]\n",
      "Best score: 48.0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Average Metric: 25 / 50  (50.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1034.71it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Metric: 25 / 50  (50.0%)\n",
      "Score: 50.0 for set: [8]\n",
      "New best score: 50.0 for seed -2\n",
      "Scores so far: [48.0, 50.0]\n",
      "Best score: 50.0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  6%|███▎                                                         | 11/200 [00:00<00:00, 899.26it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bootstrapped 8 full traces after 12 examples in round 0.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Average Metric: 27 / 50  (54.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1225.04it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Metric: 27 / 50  (54.0%)\n",
      "Score: 54.0 for set: [8]\n",
      "New best score: 54.0 for seed -1\n",
      "Scores so far: [48.0, 50.0, 54.0]\n",
      "Best score: 54.0\n",
      "Average of max per entry across top 1 scores: 0.54\n",
      "Average of max per entry across top 2 scores: 0.7\n",
      "Average of max per entry across top 3 scores: 0.76\n",
      "Average of max per entry across top 5 scores: 0.76\n",
      "Average of max per entry across top 8 scores: 0.76\n",
      "Average of max per entry across top 9999 scores: 0.76\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  4%|██▊                                                           | 9/200 [00:00<00:00, 815.06it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bootstrapped 7 full traces after 10 examples in round 0.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Average Metric: 37 / 50  (74.0): 100%|█████████████████████████████| 50/50 [00:00<00:00, 884.47it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Metric: 37 / 50  (74.0%)\n",
      "Score: 74.0 for set: [8]\n",
      "New best score: 74.0 for seed 0\n",
      "Scores so far: [48.0, 50.0, 54.0, 74.0]\n",
      "Best score: 74.0\n",
      "Average of max per entry across top 1 scores: 0.74\n",
      "Average of max per entry across top 2 scores: 0.78\n",
      "Average of max per entry across top 3 scores: 0.86\n",
      "Average of max per entry across top 5 scores: 0.92\n",
      "Average of max per entry across top 8 scores: 0.92\n",
      "Average of max per entry across top 9999 scores: 0.92\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  2%|█▏                                                            | 4/200 [00:00<00:00, 309.09it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bootstrapped 3 full traces after 5 examples in round 0.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Average Metric: 28 / 50  (56.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1111.93it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Metric: 28 / 50  (56.0%)\n",
      "Score: 56.0 for set: [8]\n",
      "Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0]\n",
      "Best score: 74.0\n",
      "Average of max per entry across top 1 scores: 0.74\n",
      "Average of max per entry across top 2 scores: 0.8\n",
      "Average of max per entry across top 3 scores: 0.82\n",
      "Average of max per entry across top 5 scores: 0.92\n",
      "Average of max per entry across top 8 scores: 0.92\n",
      "Average of max per entry across top 9999 scores: 0.92\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|▎                                                             | 1/200 [00:00<00:00, 712.23it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bootstrapped 1 full traces after 2 examples in round 0.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Average Metric: 31 / 50  (62.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1043.32it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Metric: 31 / 50  (62.0%)\n",
      "Score: 62.0 for set: [8]\n",
      "Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0]\n",
      "Best score: 74.0\n",
      "Average of max per entry across top 1 scores: 0.74\n",
      "Average of max per entry across top 2 scores: 0.86\n",
      "Average of max per entry across top 3 scores: 0.9\n",
      "Average of max per entry across top 5 scores: 0.94\n",
      "Average of max per entry across top 8 scores: 0.94\n",
      "Average of max per entry across top 9999 scores: 0.94\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  2%|█▏                                                            | 4/200 [00:00<00:00, 837.65it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bootstrapped 4 full traces after 5 examples in round 0.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Average Metric: 23 / 50  (46.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1104.00it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Metric: 23 / 50  (46.0%)\n",
      "Score: 46.0 for set: [8]\n",
      "Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0]\n",
      "Best score: 74.0\n",
      "Average of max per entry across top 1 scores: 0.74\n",
      "Average of max per entry across top 2 scores: 0.86\n",
      "Average of max per entry across top 3 scores: 0.9\n",
      "Average of max per entry across top 5 scores: 0.94\n",
      "Average of max per entry across top 8 scores: 0.96\n",
      "Average of max per entry across top 9999 scores: 0.96\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  2%|█▏                                                            | 4/200 [00:00<00:00, 802.55it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bootstrapped 4 full traces after 5 examples in round 0.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Average Metric: 34 / 50  (68.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1116.66it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Metric: 34 / 50  (68.0%)\n",
      "Score: 68.0 for set: [8]\n",
      "Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0]\n",
      "Best score: 74.0\n",
      "Average of max per entry across top 1 scores: 0.74\n",
      "Average of max per entry across top 2 scores: 0.92\n",
      "Average of max per entry across top 3 scores: 0.98\n",
      "Average of max per entry across top 5 scores: 0.98\n",
      "Average of max per entry across top 8 scores: 1.0\n",
      "Average of max per entry across top 9999 scores: 1.0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  2%|█▌                                                            | 5/200 [00:00<00:00, 855.28it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bootstrapped 5 full traces after 6 examples in round 0.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Average Metric: 30 / 50  (60.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1148.03it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Metric: 30 / 50  (60.0%)\n",
      "Score: 60.0 for set: [8]\n",
      "Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0]\n",
      "Best score: 74.0\n",
      "Average of max per entry across top 1 scores: 0.74\n",
      "Average of max per entry across top 2 scores: 0.92\n",
      "Average of max per entry across top 3 scores: 0.98\n",
      "Average of max per entry across top 5 scores: 0.98\n",
      "Average of max per entry across top 8 scores: 1.0\n",
      "Average of max per entry across top 9999 scores: 1.0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  1%|▌                                                             | 2/200 [00:00<00:00, 723.34it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bootstrapped 2 full traces after 3 examples in round 0.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Average Metric: 27 / 50  (54.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1109.09it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Metric: 27 / 50  (54.0%)\n",
      "Score: 54.0 for set: [8]\n",
      "Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0, 54.0]\n",
      "Best score: 74.0\n",
      "Average of max per entry across top 1 scores: 0.74\n",
      "Average of max per entry across top 2 scores: 0.92\n",
      "Average of max per entry across top 3 scores: 0.98\n",
      "Average of max per entry across top 5 scores: 0.98\n",
      "Average of max per entry across top 8 scores: 1.0\n",
      "Average of max per entry across top 9999 scores: 1.0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  3%|█▊                                                            | 6/200 [00:00<00:00, 828.15it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bootstrapped 6 full traces after 7 examples in round 0.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Average Metric: 28 / 50  (56.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1036.51it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Metric: 28 / 50  (56.0%)\n",
      "Score: 56.0 for set: [8]\n",
      "Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0, 54.0, 56.0]\n",
      "Best score: 74.0\n",
      "Average of max per entry across top 1 scores: 0.74\n",
      "Average of max per entry across top 2 scores: 0.92\n",
      "Average of max per entry across top 3 scores: 0.98\n",
      "Average of max per entry across top 5 scores: 0.98\n",
      "Average of max per entry across top 8 scores: 1.0\n",
      "Average of max per entry across top 9999 scores: 1.0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  2%|█▌                                                            | 5/200 [00:00<00:00, 790.78it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bootstrapped 4 full traces after 6 examples in round 0.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Average Metric: 25 / 50  (50.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1128.36it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Metric: 25 / 50  (50.0%)\n",
      "Score: 50.0 for set: [8]\n",
      "Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0, 54.0, 56.0, 50.0]\n",
      "Best score: 74.0\n",
      "Average of max per entry across top 1 scores: 0.74\n",
      "Average of max per entry across top 2 scores: 0.92\n",
      "Average of max per entry across top 3 scores: 0.98\n",
      "Average of max per entry across top 5 scores: 0.98\n",
      "Average of max per entry across top 8 scores: 1.0\n",
      "Average of max per entry across top 9999 scores: 1.0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  4%|██▍                                                           | 8/200 [00:00<00:00, 845.75it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bootstrapped 8 full traces after 9 examples in round 0.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Average Metric: 31 / 50  (62.0): 100%|█████████████████████████████| 50/50 [00:00<00:00, 921.83it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Metric: 31 / 50  (62.0%)\n",
      "Score: 62.0 for set: [8]\n",
      "Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0, 54.0, 56.0, 50.0, 62.0]\n",
      "Best score: 74.0\n",
      "Average of max per entry across top 1 scores: 0.74\n",
      "Average of max per entry across top 2 scores: 0.92\n",
      "Average of max per entry across top 3 scores: 0.98\n",
      "Average of max per entry across top 5 scores: 0.98\n",
      "Average of max per entry across top 8 scores: 1.0\n",
      "Average of max per entry across top 9999 scores: 1.0\n",
      "13 candidate programs found.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "if RUN_FROM_SCRATCH:\n",
    "    cot_fewshot = bootstrap_optimizer.compile(cot_zeroshot, trainset=train, valset=dev)\n",
    "else:\n",
    "    cot_fewshot = ScoNeCoT()\n",
    "    cot_fewshot.load(\"scone-cot_fewshot-turbo-gpt4-demos.json\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "3256d089-d09c-440a-a059-0ec871728713",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Average Metric: 171 / 200  (85.5): 100%|█████████████████████████| 200/200 [00:00<00:00, 557.50it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average Metric: 171 / 200  (85.5%)\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "85.5"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluator(cot_fewshot, metric=scone_accuracy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "58730a3a-bd11-4e6a-9945-1800ee47232b",
   "metadata": {},
   "outputs": [],
   "source": [
    "cot_fewshot.save(\"scone-cot_fewshot-turbo-gpt4-demos.json\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f67c97cf-c089-49b9-b5ef-ac220d9caf7f",
   "metadata": {},
   "source": [
    "## Example prompt with prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "12b76b5f-5ca8-4de7-9afe-d8e6ab3174f9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n",
      "\n",
      "You are given some context (a premise) and a question (a hypothesis). You must indicate with Yes/No answer whether we can logically conclude the hypothesis from the premise.\n",
      "\n",
      "---\n",
      "\n",
      "Follow the following format.\n",
      "\n",
      "Context: ${context}\n",
      "\n",
      "Question: ${question}\n",
      "\n",
      "Reasoning: Let's think step by step in order to ${produce the answer}. We ...\n",
      "\n",
      "Answer: Yes or No\n",
      "\n",
      "---\n",
      "\n",
      "Context: It is not true that there is not a single person walking in the city.\n",
      "\n",
      "Question: Can we logically conclude for sure that it is not true that there is not a single celebrity walking in the city?\n",
      "\n",
      "Reasoning: Let's think step by step in order to produce the answer. We know that the double negative in the context implies that there is at least one person walking in the city. However, the context does not provide any information about the status or occupation of the person walking in the city. Therefore, we cannot logically conclude that the person walking in the city is a celebrity.\n",
      "\n",
      "Answer: No\n",
      "\n",
      "---\n",
      "\n",
      "Context: the boy, not girl, will play an trombone, but not for another week\n",
      "\n",
      "Question: Can we logically conclude for sure that the boy, not girl, will play an instrument, but not for another week?\n",
      "\n",
      "Reasoning: Let's think step by step in order to produce the answer. We know that the boy will play a trombone, which is a type of instrument. The context specifies that this will happen not for another week, which means it will happen in the future, but not immediately. The gender of the person is also specified as a boy, not a girl.\n",
      "\n",
      "Answer: Yes\n",
      "\n",
      "---\n",
      "\n",
      "Context: A man is not holding anything in his hands.\n",
      "\n",
      "Question: Can we logically conclude for sure that a man is not holding beverages in his hands?\n",
      "\n",
      "Reasoning: Let's think step by step in order to produce the answer. We know that the man is not holding anything in his hands. Beverages are a subset of \"anything.\" Therefore, if he is not holding anything, he is also not holding beverages.\n",
      "\n",
      "Answer: Yes\n",
      "\n",
      "---\n",
      "\n",
      "Context: There is not a boat nearby.\n",
      "\n",
      "Question: Can we logically conclude for sure that there is not a speedboat nearby?\n",
      "\n",
      "Reasoning: Let's think step by step in order to produce the answer. We know that there is not a boat nearby. A speedboat is a type of boat. If there are no boats nearby, then there cannot be a speedboat nearby either, because a speedboat is included in the category of boats.\n",
      "\n",
      "Answer: Yes\n",
      "\n",
      "---\n",
      "\n",
      "Context: The man is not listening to music.\n",
      "\n",
      "Question: Can we logically conclude for sure that the man is not listening to rockabilly?\n",
      "\n",
      "Reasoning: Let's think step by step in order to produce the answer. We know that the man is not listening to music. Rockabilly is a genre of music. If he is not listening to music at all, then he cannot be listening to rockabilly, which is a specific type of music.\n",
      "\n",
      "Answer: Yes\n",
      "\n",
      "---\n",
      "\n",
      "Context: There is not a single person walking in the city.\n",
      "\n",
      "Question: Can we logically conclude for sure that there is not a single mover walking in the city?\n",
      "\n",
      "Reasoning: Let's think step by step in order to produce the answer. We know that there is not a single person walking in the city. A mover is a type of person. If there are no people walking in the city, then there cannot be a mover walking in the city either, because a mover is included in the broader category of people.\n",
      "\n",
      "Answer: Yes\n",
      "\n",
      "---\n",
      "\n",
      "Context: a man walks not so confidently, but he does not drop eggplants\n",
      "\n",
      "Question: Can we logically conclude for sure that a man walks not so confidently, but he does not drop produce?\n",
      "\n",
      "Reasoning: Let's think step by step in order to produce the answer. We know that the man does not drop eggplants, which are a type of produce. However, the statement does not provide information about other types of produce. Therefore, we cannot conclude that he does not drop any type of produce, only that he does not drop eggplants specifically.\n",
      "\n",
      "Answer: No\n",
      "\n",
      "---\n",
      "\n",
      "Context: The girl who is not here is not wearing any jewelry at all.\n",
      "Question: Can we logically conclude for sure that the girl who is not here is not wearing any ringlets at all?\n",
      "Answer: Yes\n",
      "\n",
      "---\n",
      "\n",
      "Context: The man is not steering a sedan\n",
      "\n",
      "Question: Can we logically conclude for sure that the man is not steering a car?\n",
      "\n",
      "Reasoning: Let's think step by step in order to\u001b[32m produce the answer. We know that the man is not steering a sedan, which is a specific type of car. However, the context does not provide any information about whether he is steering any other type of car. Therefore, we cannot logically conclude that he is not steering a car at all.\n",
      "\n",
      "Answer: No\u001b[0m\n",
      "\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "turbo.inspect_history(n=1)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
